Add CUDA process checkpointing helpers#1983
Conversation
396a2ca to
7c66b2f
Compare
|
/ok to test |
@kkraus14, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 7c66b2f |
779c697 to
82f816c
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
82f816c to
25455d8
Compare
|
/ok to test |
25455d8 to
aaf1418
Compare
|
/ok to test |
aaf1418 to
d8a2031
Compare
|
/ok to test |
Replace the entire mock-based test suite with real GPU tests that exercise the CUDA driver checkpoint API directly: - Input validation: pid type/range, public symbol checks - Lifecycle (single GPU): state transitions at every stage (running→locked→checkpointed→locked→running), restore_thread_id, lock/unlock, lock with timeout, full checkpoint-restore cycle - GPU migration: rotation mapping and same-chip swap following the r580-migration-api.c pattern; gracefully skip when the driver does not support migration (CUDA_ERROR_INVALID_VALUE — NVBug 5437334) The self_process fixture wraps os.getpid() and safety-unlocks on teardown if the test fails mid-lifecycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- checkpoint._make_restore_args now accepts UUID strings (as returned by Device.uuid) in addition to CUuuid objects, via a new _as_cuuuid helper that converts "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" strings to CUuuid using ctypes. - Tests no longer import cuda.bindings.driver; all device queries use cuda.core.Device (Device().uuid for current device, Device.uuid for mapping keys/values, Device.get_all_devices() for enumeration). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ruff import sorting, ruff format, and noqa annotation for best-effort teardown in the self_process fixture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The swap migration test calls set_current() on a different device. Record the initial device from init_cuda and restore it on teardown so tests are side-effect free. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Review follow-up notes for signed commit GitHub accepted the inline reply for the restore-mapping xref thread, but the remaining inline reply attempts are currently returning GitHub server errors, so I am summarizing the responses here.
|
|
/ok to test |
@kkraus14, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 8192df6 |
|
Copy-pasta from my bot, with internal info redacted. What I did
Test suite structure (13 tests)
Lessons learned
|
|
Pushed e9c03de because the real tests hang in the CI... |
|
/ok to test e9c03de |
cuCheckpointProcessCheckpoint hangs on CI runners (ephemeral VM + container), causing all CUDA 13.x test jobs to time out. Skip the tests that call into the checkpoint driver when the CI environment variable is set. Input validation tests still run everywhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e9c03de to
8f798f4
Compare
|
/ok to test 8f798f4 |
|
@kkraus14 I noticed your force-push (to stringify the tests for subprocess) still includes my WAR (checking the env var |
leofang
left a comment
There was a problem hiding this comment.
Since I also made some changes, would be nice for @Andy-Jost to re-review 🙂
|
Addressed Leo’s latest review comments in signed commit On CI: the remaining Other updates:
Local validation passed: |
|
/ok to test 376acc7 |
Summary
cuda.core.checkpointmodule for CUDA process checkpointing APIs while keepingcuda.core.systemfocused on CUDA system/NVML capabilitiescheckpoint.Process(pid): read-onlypid,state,restore_thread_id,lock,checkpoint,restore, andunlockProcess; the state return type lives incuda.core.typing.ProcessStateTand is rendered in the private API docsProcess.statefrom the CUDA driverCUprocessStateenumerators rather than raw integer valuesCUuuidvalues orDevice.uuidstrings; migration docs and tests now describe the stricter kernel-mode-driver visibility requirement rather than user-space CUDA visibilityCAP_SYS_PTRACE, the CRIU/CPU-process-image boundary, restore-thread requirement, and persistence mode/cuInitrestore requirementcuda-bindingsversion, required binding symbols, and CUDA driver versionRuntimeErrorCloses #1343
Testing
git commit -Spre-commit hooks: ruff, formatting, SPDX, whitespace, RST, and related checks passedgit diff --checkpixi run ruff check cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py(All checks passed)pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py cuda_core/tests/test_typing_imports.py(10 passed,6 skipped)pixi run --manifest-path cuda_core -e docs docs-build-latest(Sphinx build succeeded)pixi run --manifest-path cuda_core pytest cuda_core/tests --ignore=cuda_core/tests/cython(2798 passed,352 skipped,2 failed)The two Python-suite failures in the broader run are existing local NVML/system environment failures and are not related to this checkpointing change:
cuda_core/tests/system/test_system_device.py::test_get_inforom_versionreturns an empty InfoROM board part number locally.cuda_core/tests/system/test_system_system.py::test_get_process_namehits an NVML UTF-8 decode error locally.Additional local build/test notes:
pixi run --manifest-path cuda_core teststops before pytest in the existingbuild-cython-testspre-step becausecuda_core/tests/cython/test_get_cuda_native_handle.pyxcannot find the expectedcuda.bindings.pxdfiles in this local pixi environment.pixi buildfromcuda_corereaches the existing native cuda-core extension build and then fails with CUDA 12.9 headers that do not declareCU_MEM_ALLOCATION_TYPE_MANAGED; this is in the existing graph extension build path and is not checkpoint-specific.CI note:
8192df67exposed two unrelated/runner issues: one Windows py3.12 build failed inside the shared mini-CTK cache setup before any cuda.core build step, and CUDA 13.x GPU test jobs were canceled after the old in-process checkpoint test hung incuCheckpointProcessCheckpoint.CIskip. Driver-backed checkpoint lifecycle tests now run through isolated subprocess coordinator/target scenarios, and the parent pytest process can kill and skip a scenario that times out instead of letting the CI job hang.Current Test Implementation
The checkpoint tests in
cuda_core/tests/test_checkpoint.pyare real driver/GPU tests, not broad mocks.Input validation and public-symbol checks run everywhere. Driver-backed lifecycle tests create a target process that initializes a real CUDA context, then a coordinator scenario calls
checkpoint.Process(target.pid)and exercisesstate,restore_thread_id,lock,checkpoint,restore, andunlockthrough the real driver. The parent pytest process enforces a timeout around each scenario so unsupported driver/hardware paths skip cleanly instead of hanging the test job.Migration tests require at least two same-chip GPUs and an unmasked CUDA device view. They build full UUID mappings using
Device.uuidstrings, then exercise rotation and pair-swap migration patterns throughProcess.restore(gpu_mapping=...)in the isolated target process. They skip gracefully whenCUDA_VISIBLE_DEVICESis set, when the local hardware lacks a same-chip GPU pair, or when the driver rejects/no-ops checkpoint migration.